!jupyter nbconvert --to html ""
Unsupervised Learning: Trade & Ahead¶
Problem Statement¶
Context¶
The stock market has consistently proven to be a good place to invest in and save for the future. There are a lot of compelling reasons to invest in stocks. It can help in fighting inflation, create wealth, and also provides some tax benefits. Good steady returns on investments over a long period of time can also grow a lot more than seems possible. Also, thanks to the power of compound interest, the earlier one starts investing, the larger the corpus one can have for retirement. Overall, investing in stocks can help meet life's financial aspirations.
It is important to maintain a diversified portfolio when investing in stocks in order to maximise earnings under any market condition. Having a diversified portfolio tends to yield higher returns and face lower risk by tempering potential losses when the market is down. It is often easy to get lost in a sea of financial metrics to analyze while determining the worth of a stock, and doing the same for a multitude of stocks to identify the right picks for an individual can be a tedious task. By doing a cluster analysis, one can identify stocks that exhibit similar characteristics and ones which exhibit minimum correlation. This will help investors better analyze stocks across different market segments and help protect against risks that could make the portfolio vulnerable to losses.
Objective¶
Trade&Ahead is a financial consultancy firm who provide their customers with personalized investment strategies. They have hired you as a Data Scientist and provided you with data comprising stock price and some financial indicators for a few companies listed under the New York Stock Exchange. They have assigned you the tasks of analyzing the data, grouping the stocks based on the attributes provided, and sharing insights about the characteristics of each group.
Data Dictionary¶
- Ticker Symbol: An abbreviation used to uniquely identify publicly traded shares of a particular stock on a particular stock market
- Company: Name of the company
- GICS Sector: The specific economic sector assigned to a company by the Global Industry Classification Standard (GICS) that best defines its business operations
- GICS Sub Industry: The specific sub-industry group assigned to a company by the Global Industry Classification Standard (GICS) that best defines its business operations
- Current Price: Current stock price in dollars
- Price Change: Percentage change in the stock price in 13 weeks
- Volatility: Standard deviation of the stock price over the past 13 weeks
- ROE: A measure of financial performance calculated by dividing net income by shareholders' equity (shareholders' equity is equal to a company's assets minus its debt)
- Cash Ratio: The ratio of a company's total reserves of cash and cash equivalents to its total current liabilities
- Net Cash Flow: The difference between a company's cash inflows and outflows (in dollars)
- Net Income: Revenues minus expenses, interest, and taxes (in dollars)
- Earnings Per Share: Company's net profit divided by the number of common shares it has outstanding (in dollars)
- Estimated Shares Outstanding: Company's stock currently held by all its shareholders
- P/E Ratio: Ratio of the company's current stock price to the earnings per share
- P/B Ratio: Ratio of the company's stock price per share by its book value per share (book value of a company is the net difference between that company's total assets and total liabilities)
Importing necessary libraries and data¶
# Installing the libraries with the specified version.
# uncomment and run the following line if Google Colab is being used
!pip install scikit-learn==1.2.2 seaborn==0.13.1 matplotlib==3.8.1 numpy==1.25.2 pandas==2.2.2 yellowbrick==1.5 -q --user
# Installing the libraries with the specified version.
# uncomment and run the following lines if Jupyter Notebook is being used
!pip install scikit-learn==1.2.2 seaborn==0.13.1 matplotlib==3.8.1 numpy==1.25.2 pandas==2.2.2 yellowbrick==1.5 -q --user
!pip install --upgrade -q jinja2
Note: After running the above cell, kindly restart the notebook kernel and run all cells sequentially from the start again.
# Libraries to help with reading and manipulating data
import numpy as np
import pandas as pd
# Libraries to help with data visualization
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_theme()
# Removes the limit for the number of displayed columns
pd.set_option("display.max_columns", None)
# Sets the limit for the number of displayed rows
pd.set_option("display.max_rows", 200)
# to scale the data using z-score
from sklearn.preprocessing import StandardScaler
# to compute distances
from scipy.spatial.distance import pdist, cdist
# to perform k-means clustering, compute metric
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
# to perform hierarchical clustering, compute cophenetic correlation, and create dendrograms
from sklearn.cluster import AgglomerativeClustering
from scipy.cluster.hierarchy import dendrogram, linkage, cophenet
#!pip install yellowbrick
from yellowbrick.cluster import KElbowVisualizer, SilhouetteVisualizer
# to perform PCA
from sklearn.decomposition import PCA
Data Overview¶
- Observations
- Sanity checks
# loading the dataset
data = pd.read_csv("stock_data.csv")
data.shape
(340, 15)
data.head()
| Ticker Symbol | Security | GICS Sector | GICS Sub Industry | Current Price | Price Change | Volatility | ROE | Cash Ratio | Net Cash Flow | Net Income | Earnings Per Share | Estimated Shares Outstanding | P/E Ratio | P/B Ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | AAL | American Airlines Group | Industrials | Airlines | 42.349998 | 9.999995 | 1.687151 | 135 | 51 | -604000000 | 7610000000 | 11.39 | 6.681299e+08 | 3.718174 | -8.784219 |
| 1 | ABBV | AbbVie | Health Care | Pharmaceuticals | 59.240002 | 8.339433 | 2.197887 | 130 | 77 | 51000000 | 5144000000 | 3.15 | 1.633016e+09 | 18.806350 | -8.750068 |
| 2 | ABT | Abbott Laboratories | Health Care | Health Care Equipment | 44.910000 | 11.301121 | 1.273646 | 21 | 67 | 938000000 | 4423000000 | 2.94 | 1.504422e+09 | 15.275510 | -0.394171 |
| 3 | ADBE | Adobe Systems Inc | Information Technology | Application Software | 93.940002 | 13.977195 | 1.357679 | 9 | 180 | -240840000 | 629551000 | 1.26 | 4.996437e+08 | 74.555557 | 4.199651 |
| 4 | ADI | Analog Devices, Inc. | Information Technology | Semiconductors | 55.320000 | -1.827858 | 1.701169 | 14 | 272 | 315120000 | 696878000 | 0.31 | 2.247994e+09 | 178.451613 | 1.059810 |
data.tail()
| Ticker Symbol | Security | GICS Sector | GICS Sub Industry | Current Price | Price Change | Volatility | ROE | Cash Ratio | Net Cash Flow | Net Income | Earnings Per Share | Estimated Shares Outstanding | P/E Ratio | P/B Ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 335 | YHOO | Yahoo Inc. | Information Technology | Internet Software & Services | 33.259998 | 14.887727 | 1.845149 | 15 | 459 | -1032187000 | -4359082000 | -4.64 | 939457327.6 | 28.976191 | 6.261775 |
| 336 | YUM | Yum! Brands Inc | Consumer Discretionary | Restaurants | 52.516175 | -8.698917 | 1.478877 | 142 | 27 | 159000000 | 1293000000 | 2.97 | 435353535.4 | 17.682214 | -3.838260 |
| 337 | ZBH | Zimmer Biomet Holdings | Health Care | Health Care Equipment | 102.589996 | 9.347683 | 1.404206 | 1 | 100 | 376000000 | 147000000 | 0.78 | 188461538.5 | 131.525636 | -23.884449 |
| 338 | ZION | Zions Bancorp | Financials | Regional Banks | 27.299999 | -1.158588 | 1.468176 | 4 | 99 | -43623000 | 309471000 | 1.20 | 257892500.0 | 22.749999 | -0.063096 |
| 339 | ZTS | Zoetis | Health Care | Pharmaceuticals | 47.919998 | 16.678836 | 1.610285 | 32 | 65 | 272000000 | 339000000 | 0.68 | 498529411.8 | 70.470585 | 1.723068 |
df = data.copy()
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 340 entries, 0 to 339 Data columns (total 15 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Ticker Symbol 340 non-null object 1 Security 340 non-null object 2 GICS Sector 340 non-null object 3 GICS Sub Industry 340 non-null object 4 Current Price 340 non-null float64 5 Price Change 340 non-null float64 6 Volatility 340 non-null float64 7 ROE 340 non-null int64 8 Cash Ratio 340 non-null int64 9 Net Cash Flow 340 non-null int64 10 Net Income 340 non-null int64 11 Earnings Per Share 340 non-null float64 12 Estimated Shares Outstanding 340 non-null float64 13 P/E Ratio 340 non-null float64 14 P/B Ratio 340 non-null float64 dtypes: float64(7), int64(4), object(4) memory usage: 40.0+ KB
- Ticker Symbol, Security, GICS Sector & GICS Sub Industry are objects and the others are float or integer type
- There are not any NULL values in the dataset
# making column names easier
df.columns = [c.replace(" ", "_") for c in df.columns]
# Dropping Ticker Symbol due to security being a clearer title for similar descriptions
df.drop("Ticker_Symbol", axis=1, inplace=True)
# lets check duplicate observations
df.duplicated().sum()
0
# lets check total null values
df.isnull().sum().sum()
0
def check_skewness(df):
# Loop through each column in the DataFrame
for column in df.select_dtypes(include=['float64', 'int64']).columns:
median = df[column].median()
mean = df[column].mean()
# Check if the median is greater than or less than the mean
if median > mean:
skew_direction = "skewed to the left"
elif median < mean:
skew_direction = "skewed to the right"
else:
skew_direction = "symmetric"
# Print the column name and the skewness direction
print(f"{'- '}{column} is {skew_direction}")
# Example usage
# df.describe(include="all").T is used for full summary, but in this function we only need the numeric columns
check_skewness(df)
- Current_Price is skewed to the right - Price_Change is skewed to the left - Volatility is skewed to the right - ROE is skewed to the right - Cash_Ratio is skewed to the right - Net_Cash_Flow is skewed to the right - Net_Income is skewed to the right - Earnings_Per_Share is skewed to the left - Estimated_Shares_Outstanding is skewed to the right - P/E_Ratio is skewed to the right - P/B_Ratio is skewed to the left
#Looking at the statistical summary of the data
df.describe(include="all").T
| count | unique | top | freq | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Security | 340 | 340 | American Airlines Group | 1 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| GICS_Sector | 340 | 11 | Industrials | 53 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| GICS_Sub_Industry | 340 | 104 | Oil & Gas Exploration & Production | 16 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Current_Price | 340.0 | NaN | NaN | NaN | 80.862345 | 98.055086 | 4.5 | 38.555 | 59.705 | 92.880001 | 1274.949951 |
| Price_Change | 340.0 | NaN | NaN | NaN | 4.078194 | 12.006338 | -47.129693 | -0.939484 | 4.819505 | 10.695493 | 55.051683 |
| Volatility | 340.0 | NaN | NaN | NaN | 1.525976 | 0.591798 | 0.733163 | 1.134878 | 1.385593 | 1.695549 | 4.580042 |
| ROE | 340.0 | NaN | NaN | NaN | 39.597059 | 96.547538 | 1.0 | 9.75 | 15.0 | 27.0 | 917.0 |
| Cash_Ratio | 340.0 | NaN | NaN | NaN | 70.023529 | 90.421331 | 0.0 | 18.0 | 47.0 | 99.0 | 958.0 |
| Net_Cash_Flow | 340.0 | NaN | NaN | NaN | 55537620.588235 | 1946365312.175789 | -11208000000.0 | -193906500.0 | 2098000.0 | 169810750.0 | 20764000000.0 |
| Net_Income | 340.0 | NaN | NaN | NaN | 1494384602.941176 | 3940150279.327937 | -23528000000.0 | 352301250.0 | 707336000.0 | 1899000000.0 | 24442000000.0 |
| Earnings_Per_Share | 340.0 | NaN | NaN | NaN | 2.776662 | 6.587779 | -61.2 | 1.5575 | 2.895 | 4.62 | 50.09 |
| Estimated_Shares_Outstanding | 340.0 | NaN | NaN | NaN | 577028337.75403 | 845849595.417695 | 27672156.86 | 158848216.1 | 309675137.8 | 573117457.325 | 6159292035.0 |
| P/E_Ratio | 340.0 | NaN | NaN | NaN | 32.612563 | 44.348731 | 2.935451 | 15.044653 | 20.819876 | 31.764755 | 528.039074 |
| P/B_Ratio | 340.0 | NaN | NaN | NaN | -1.718249 | 13.966912 | -76.119077 | -4.352056 | -1.06717 | 3.917066 | 129.064585 |
- There are 340 security, eleven sectors, and 104 subsectors
- Current_Price is skewed to the right
- Price_Change is skewed to the left
- Volatility is skewed to the right
- ROE is skewed to the right
- Cash_Ratio is skewed to the right
- Net_Cash_Flow is skewed to the right
- Net_Income is skewed to the right
- Earnings_Per_Share is skewed to the left
- Estimated_Shares_Outstanding is skewed to the right
- P/E_Ratio is skewed to the right
- P/B_Ratio is skewed to the left
Exploratory Data Analysis (EDA)¶
- EDA is an important part of any project involving data.
- It is important to investigate and understand the data better before building a model with it.
- A few questions have been mentioned below which will help you approach the analysis in the right manner and generate insights from the data.
- A thorough analysis of the data, in addition to the questions mentioned below, should be done.
Questions:
- What does the distribution of stock prices look like?
- Skewed to the right with outliers on the right side
- The stocks of which economic sector have seen the maximum price increase on average?
- Healthcare and information technology
- How are the different variables correlated with each other?
- Price_Change is negativly correlated with Volatility, the higher the volitility the higher the price change. While Volitility has a per share has a negative correlation with Earnings per share.
- Earnings_Per_Share has a positive correlation with Current_Price & Net_Income, meaning
- Cash ratio provides a measure of a company's ability to cover its short-term obligations using only cash and cash equivalents. How does the average cash ratio vary across economic sectors?
Informational TEchnology has some of the highest outliers, while Information Sector and Finacials have high median cas_ ratios. Real Esatte and Financials ahve the minimum Cash Ration variance.
- P/E ratios can help determine the relative value of a company's shares as they signify the amount of money an investor is willing to invest in a single share of a company per dollar of its earnings. How does the P/E ratio vary, on average, across economic sectors? Energy has the highest cariance, people are more willing to invest $ into a share.
# function to plot a boxplot and a histogram along the same scale.
def histogram_boxplot(data, feature, figsize=(12, 7), kde=False, bins=None):
"""
Boxplot and histogram combined
data: dataframe
feature: dataframe column
figsize: size of figure (default (12,7))
kde: whether to the show density curve (default False)
bins: number of bins for histogram (default None)
"""
f2, (ax_box2, ax_hist2) = plt.subplots(
nrows=2, # Number of rows of the subplot grid= 2
sharex=True, # x-axis will be shared among all subplots
gridspec_kw={"height_ratios": (0.25, 0.75)},
figsize=figsize,
) # creating the 2 subplots
sns.boxplot(
data=data, x=feature, ax=ax_box2, showmeans=True, color="violet"
) # boxplot will be created and a star will indicate the mean value of the column
sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2, bins=bins, palette="winter"
) if bins else sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2
) # For histogram
ax_hist2.axvline(
data[feature].mean(), color="green", linestyle="--"
) # Add mean to the histogram
ax_hist2.axvline(
data[feature].median(), color="black", linestyle="-"
) # Add median to the histogram
# numerical columns only
numeric_columns = df.select_dtypes(include=np.number).columns.tolist()
histogram_boxplot(df, numeric_columns[0], bins=50, kde=True, figsize=(10, 5))
/root/.local/lib/python3.10/site-packages/seaborn/categorical.py:640: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas. positions = grouped.grouper.result_index.to_numpy(dtype=float) <ipython-input-16-f3b56adba9bd>:23: UserWarning: Ignoring `palette` because no `hue` variable has been assigned. sns.histplot(
Description given earlier
histogram_boxplot(df, numeric_columns[1], bins=50, kde=True, figsize=(10, 5))
/root/.local/lib/python3.10/site-packages/seaborn/categorical.py:640: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas. positions = grouped.grouper.result_index.to_numpy(dtype=float) <ipython-input-16-f3b56adba9bd>:23: UserWarning: Ignoring `palette` because no `hue` variable has been assigned. sns.histplot(
Description given earlier
histogram_boxplot(df, numeric_columns[2], bins=50, kde=True, figsize=(10, 5))
/root/.local/lib/python3.10/site-packages/seaborn/categorical.py:640: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas. positions = grouped.grouper.result_index.to_numpy(dtype=float) <ipython-input-16-f3b56adba9bd>:23: UserWarning: Ignoring `palette` because no `hue` variable has been assigned. sns.histplot(
Description given earlier
histogram_boxplot(df, numeric_columns[3], bins=50, kde=True, figsize=(10, 5))
/root/.local/lib/python3.10/site-packages/seaborn/categorical.py:640: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas. positions = grouped.grouper.result_index.to_numpy(dtype=float) <ipython-input-16-f3b56adba9bd>:23: UserWarning: Ignoring `palette` because no `hue` variable has been assigned. sns.histplot(
Description given earlier
histogram_boxplot(df, numeric_columns[4], bins=50, kde=True, figsize=(10, 5))
/root/.local/lib/python3.10/site-packages/seaborn/categorical.py:640: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas. positions = grouped.grouper.result_index.to_numpy(dtype=float) <ipython-input-16-f3b56adba9bd>:23: UserWarning: Ignoring `palette` because no `hue` variable has been assigned. sns.histplot(
Description given earlier
histogram_boxplot(df, numeric_columns[5], bins=50, kde=True, figsize=(10, 5))
/root/.local/lib/python3.10/site-packages/seaborn/categorical.py:640: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas. positions = grouped.grouper.result_index.to_numpy(dtype=float) <ipython-input-16-f3b56adba9bd>:23: UserWarning: Ignoring `palette` because no `hue` variable has been assigned. sns.histplot(
Description given earlier
histogram_boxplot(df, numeric_columns[6], bins=50, kde=True, figsize=(10, 5))
/root/.local/lib/python3.10/site-packages/seaborn/categorical.py:640: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas. positions = grouped.grouper.result_index.to_numpy(dtype=float) <ipython-input-16-f3b56adba9bd>:23: UserWarning: Ignoring `palette` because no `hue` variable has been assigned. sns.histplot(
Description given earlier
histogram_boxplot(df, numeric_columns[7], bins=50, kde=True, figsize=(10, 5))
/root/.local/lib/python3.10/site-packages/seaborn/categorical.py:640: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas. positions = grouped.grouper.result_index.to_numpy(dtype=float) <ipython-input-16-f3b56adba9bd>:23: UserWarning: Ignoring `palette` because no `hue` variable has been assigned. sns.histplot(
Description given earlier
histogram_boxplot(df, numeric_columns[8], bins=50, kde=True, figsize=(10, 5))
/root/.local/lib/python3.10/site-packages/seaborn/categorical.py:640: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas. positions = grouped.grouper.result_index.to_numpy(dtype=float) <ipython-input-16-f3b56adba9bd>:23: UserWarning: Ignoring `palette` because no `hue` variable has been assigned. sns.histplot(
Description given earlier
# function to create labeled barplots
def labeled_barplot(data, feature, perc=False, n=None):
"""
Barplot with percentage at the top
data: dataframe
feature: dataframe column
perc: whether to display percentages instead of count (default is False)
n: displays the top n category levels (default is None, i.e., display all levels)
"""
total = len(data[feature]) # length of the column
count = data[feature].nunique()
if n is None:
plt.figure(figsize=(count + 1, 5))
else:
plt.figure(figsize=(n + 1, 5))
plt.xticks(rotation=90, fontsize=15)
ax = sns.countplot(
data=data,
x=feature,
palette="Paired",
order=data[feature].value_counts().index[:n].sort_values(),
)
for p in ax.patches:
if perc == True:
label = "{:.1f}%".format(
100 * p.get_height() / total
) # percentage of each class of the category
else:
label = p.get_height() # count of each level of the category
x = p.get_x() + p.get_width() / 2 # width of the plot
y = p.get_height() # height of the plot
ax.annotate(
label,
(x, y),
ha="center",
va="center",
size=12,
xytext=(0, 5),
textcoords="offset points",
) # annotate the percentage
plt.show() # show the plot
labeled_barplot(df, "GICS_Sector", perc=True)
<ipython-input-27-1e58b03272cc>:22: FutureWarning: Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect. ax = sns.countplot( /root/.local/lib/python3.10/site-packages/seaborn/_base.py:949: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning. data_subset = grouped_data.get_group(pd_key) /root/.local/lib/python3.10/site-packages/seaborn/_base.py:949: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning. data_subset = grouped_data.get_group(pd_key) /root/.local/lib/python3.10/site-packages/seaborn/_base.py:949: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning. data_subset = grouped_data.get_group(pd_key) /root/.local/lib/python3.10/site-packages/seaborn/_base.py:949: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning. data_subset = grouped_data.get_group(pd_key) /root/.local/lib/python3.10/site-packages/seaborn/_base.py:949: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning. data_subset = grouped_data.get_group(pd_key) /root/.local/lib/python3.10/site-packages/seaborn/_base.py:949: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning. data_subset = grouped_data.get_group(pd_key) /root/.local/lib/python3.10/site-packages/seaborn/_base.py:949: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning. data_subset = grouped_data.get_group(pd_key) /root/.local/lib/python3.10/site-packages/seaborn/_base.py:949: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning. data_subset = grouped_data.get_group(pd_key) /root/.local/lib/python3.10/site-packages/seaborn/_base.py:949: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning. data_subset = grouped_data.get_group(pd_key) /root/.local/lib/python3.10/site-packages/seaborn/_base.py:949: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning. data_subset = grouped_data.get_group(pd_key) /root/.local/lib/python3.10/site-packages/seaborn/_base.py:949: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning. data_subset = grouped_data.get_group(pd_key)
pd.crosstab(df.GICS_Sub_Industry, df.GICS_Sector).style.highlight_max(
color="lightblue", axis=0
)
| GICS_Sector | Consumer Discretionary | Consumer Staples | Energy | Financials | Health Care | Industrials | Information Technology | Materials | Real Estate | Telecommunications Services | Utilities |
|---|---|---|---|---|---|---|---|---|---|---|---|
| GICS_Sub_Industry | |||||||||||
| Advertising | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Aerospace & Defense | 0 | 0 | 0 | 0 | 0 | 4 | 0 | 0 | 0 | 0 | 0 |
| Agricultural Products | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Air Freight & Logistics | 0 | 0 | 0 | 0 | 0 | 3 | 0 | 0 | 0 | 0 | 0 |
| Airlines | 0 | 0 | 0 | 0 | 0 | 5 | 0 | 0 | 0 | 0 | 0 |
| Alternative Carriers | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
| Apparel, Accessories & Luxury Goods | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Application Software | 0 | 0 | 0 | 0 | 0 | 0 | 2 | 0 | 0 | 0 | 0 |
| Asset Management & Custody Banks | 0 | 0 | 0 | 4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Auto Parts & Equipment | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Automobile Manufacturers | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Banks | 0 | 0 | 0 | 10 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Biotechnology | 0 | 0 | 0 | 0 | 7 | 0 | 0 | 0 | 0 | 0 | 0 |
| Brewers | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Broadcasting & Cable TV | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Building Products | 0 | 0 | 0 | 0 | 0 | 4 | 0 | 0 | 0 | 0 | 0 |
| Cable & Satellite | 3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Casinos & Gaming | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Computer Hardware | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
| Construction & Farm Machinery & Heavy Trucks | 0 | 0 | 0 | 0 | 0 | 3 | 0 | 0 | 0 | 0 | 0 |
| Construction Materials | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2 | 0 | 0 | 0 |
| Consumer Electronics | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Consumer Finance | 0 | 0 | 0 | 5 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Copper | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
| Data Processing & Outsourced Services | 0 | 0 | 0 | 0 | 0 | 0 | 2 | 0 | 0 | 0 | 0 |
| Distributors | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Diversified Chemicals | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 5 | 0 | 0 | 0 |
| Diversified Commercial Services | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
| Diversified Financial Services | 0 | 0 | 0 | 7 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Drug Retail | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Electric Utilities | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 12 |
| Electrical Components & Equipment | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
| Electronic Components | 0 | 0 | 0 | 0 | 0 | 0 | 2 | 0 | 0 | 0 | 0 |
| Electronic Equipment & Instruments | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
| Environmental Services | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
| Fertilizers & Agricultural Chemicals | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2 | 0 | 0 | 0 |
| Financial Exchanges & Data | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Gold | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
| Health Care Distributors | 0 | 0 | 0 | 0 | 3 | 0 | 0 | 0 | 0 | 0 | 0 |
| Health Care Equipment | 0 | 0 | 0 | 0 | 11 | 0 | 0 | 0 | 0 | 0 | 0 |
| Health Care Facilities | 0 | 0 | 0 | 0 | 5 | 0 | 0 | 0 | 0 | 0 | 0 |
| Health Care Supplies | 0 | 0 | 0 | 0 | 2 | 0 | 0 | 0 | 0 | 0 | 0 |
| Home Entertainment Software | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
| Home Furnishings | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Homebuilding | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Hotels, Resorts & Cruise Lines | 4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Household Appliances | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Household Products | 0 | 3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Housewares & Specialties | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Human Resource & Employment Services | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
| IT Consulting & Other Services | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 0 | 0 | 0 | 0 |
| Industrial Conglomerates | 0 | 0 | 0 | 0 | 0 | 14 | 0 | 0 | 0 | 0 | 0 |
| Industrial Gases | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
| Industrial Machinery | 0 | 0 | 0 | 0 | 0 | 5 | 0 | 0 | 0 | 0 | 0 |
| Industrial Materials | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
| Insurance Brokers | 0 | 0 | 0 | 3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Integrated Oil & Gas | 0 | 0 | 5 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Integrated Telecommunications Services | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 0 |
| Internet & Direct Marketing Retail | 4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Internet Software & Services | 0 | 0 | 0 | 0 | 0 | 0 | 12 | 0 | 0 | 0 | 0 |
| Investment Banking & Brokerage | 0 | 0 | 0 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Leisure Products | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Life & Health Insurance | 0 | 0 | 0 | 3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Life Sciences Tools & Services | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
| Managed Health Care | 0 | 0 | 0 | 0 | 5 | 0 | 0 | 0 | 0 | 0 | 0 |
| Metal & Glass Containers | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
| Motorcycle Manufacturers | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Multi-Sector Holdings | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Multi-line Insurance | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| MultiUtilities | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 11 |
| Networking Equipment | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
| Office REITs | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
| Oil & Gas Equipment & Services | 0 | 0 | 3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Oil & Gas Exploration & Production | 0 | 0 | 16 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Oil & Gas Refining & Marketing & Transportation | 0 | 0 | 6 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Packaged Foods & Meats | 0 | 6 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Paper Packaging | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2 | 0 | 0 | 0 |
| Personal Products | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Pharmaceuticals | 0 | 0 | 0 | 0 | 6 | 0 | 0 | 0 | 0 | 0 | 0 |
| Property & Casualty Insurance | 0 | 0 | 0 | 8 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Publishing | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| REITs | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 14 | 0 | 0 |
| Railroads | 0 | 0 | 0 | 0 | 0 | 4 | 0 | 0 | 0 | 0 | 0 |
| Real Estate Services | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
| Regional Banks | 0 | 0 | 0 | 3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Research & Consulting Services | 0 | 0 | 0 | 0 | 0 | 4 | 0 | 0 | 0 | 0 | 0 |
| Residential REITs | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 0 | 0 |
| Restaurants | 3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Retail REITs | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 0 | 0 |
| Semiconductor Equipment | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
| Semiconductors | 0 | 0 | 0 | 0 | 0 | 0 | 6 | 0 | 0 | 0 | 0 |
| Soft Drinks | 0 | 4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Specialized REITs | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 0 | 0 |
| Specialty Chemicals | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 0 | 0 | 0 |
| Specialty Retail | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Specialty Stores | 3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Steel | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
| Technology Hardware, Storage & Peripherals | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
| Technology, Hardware, Software and Supplies | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
| Thrifts & Mortgage Finance | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Tires & Rubber | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Tobacco | 0 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Trucking | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
| Water Utilities | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
Bivariate Analysis
# Stock Price_Change Vs. GICS_Sector
plt.figure(figsize=(20,6))
sns.boxplot(data = df, y = "Price_Change", x = "GICS_Sector");
/root/.local/lib/python3.10/site-packages/seaborn/categorical.py:640: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas. positions = grouped.grouper.result_index.to_numpy(dtype=float)
# Cash_Ratio Vs. GICS_Sector
plt.figure(figsize=(20,8))
sns.boxplot(data = df, y = "Cash_Ratio", x = "GICS_Sector");
/root/.local/lib/python3.10/site-packages/seaborn/categorical.py:640: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas. positions = grouped.grouper.result_index.to_numpy(dtype=float)
# P/E_Ratio Vs. GICS_Sector
plt.figure(figsize=(20,8))
sns.boxplot(data = df, y = "P/E_Ratio", x = "GICS_Sector");
/root/.local/lib/python3.10/site-packages/seaborn/categorical.py:640: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas. positions = grouped.grouper.result_index.to_numpy(dtype=float)
# check for correlations
plt.figure(figsize=(15, 7))
sns.heatmap(
df[numeric_columns].corr(), annot=True, vmin=-1, vmax=1, fmt=".2f", cmap="Spectral"
)
plt.show()
# Pair-plot
sns.pairplot(df[numeric_columns],diag_kind="kde");
There seems to be relatively low correlation between all of the numerica columns.
Data Preprocessing¶
- Duplicate value check
- Missing value treatment
- Outlier check
- Feature engineering (if needed)
- Any other preprocessing steps (if needed)
# Scaling the data to bring it to the same scale
sc = StandardScaler()
subset_scaled_df = pd.DataFrame(
sc.fit_transform(df.drop(["Security", "GICS_Sector", "GICS_Sub_Industry"], axis=1)),
columns=df.drop(["Security", "GICS_Sector", "GICS_Sub_Industry"], axis=1).columns,
)
subset_scaled_df.head()
| Current_Price | Price_Change | Volatility | ROE | Cash_Ratio | Net_Cash_Flow | Net_Income | Earnings_Per_Share | Estimated_Shares_Outstanding | P/E_Ratio | P/B_Ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | -0.393341 | 0.493950 | 0.272749 | 0.989601 | -0.210698 | -0.339355 | 1.554415 | 1.309399 | 0.107863 | -0.652487 | -0.506653 |
| 1 | -0.220837 | 0.355439 | 1.137045 | 0.937737 | 0.077269 | -0.002335 | 0.927628 | 0.056755 | 1.250274 | -0.311769 | -0.504205 |
| 2 | -0.367195 | 0.602479 | -0.427007 | -0.192905 | -0.033488 | 0.454058 | 0.744371 | 0.024831 | 1.098021 | -0.391502 | 0.094941 |
| 3 | 0.133567 | 0.825696 | -0.284802 | -0.317379 | 1.218059 | -0.152497 | -0.219816 | -0.230563 | -0.091622 | 0.947148 | 0.424333 |
| 4 | -0.260874 | -0.492636 | 0.296470 | -0.265515 | 2.237018 | 0.133564 | -0.202703 | -0.374982 | 1.978399 | 3.293307 | 0.199196 |
# Pairplot analysis
sns.pairplot(subset_scaled_df ,diag_kind="kde");
K-means Clustering¶
clusters = range(1, 9)
meanDistortions = []
for k in clusters:
model = KMeans(n_clusters=k)
model.fit(subset_scaled_df)
prediction = model.predict(subset_scaled_df)
distortion = (
sum(
np.min(cdist(subset_scaled_df, model.cluster_centers_, "euclidean"), axis=1)
)
/ subset_scaled_df.shape[0]
)
meanDistortions.append(distortion)
print("Number of Clusters:", k, "\tAverage Distortion:", distortion)
plt.plot(clusters, meanDistortions, "bx-")
plt.xlabel("k")
plt.ylabel("Average distortion")
plt.title("Selecting k with the Elbow Method")
plt.show()
/root/.local/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning warnings.warn( /root/.local/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning warnings.warn(
Number of Clusters: 1 Average Distortion: 2.5425069919221697 Number of Clusters: 2 Average Distortion: 2.382318498894466 Number of Clusters:
/root/.local/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning warnings.warn( /root/.local/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning warnings.warn( /root/.local/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning warnings.warn( /root/.local/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning warnings.warn(
3 Average Distortion: 2.2659465936501304 Number of Clusters: 4 Average Distortion: 2.179645269703779 Number of Clusters: 5 Average Distortion: 2.108708268456352 Number of Clusters: 6 Average Distortion: 2.0737104927328227 Number of Clusters: 7 Average Distortion: 2.0373608736967674 Number of Clusters: 8 Average Distortion: 1.977282084957395
/root/.local/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning warnings.warn( /root/.local/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning warnings.warn(
The appriate k through the Elbow method would be 4 or 5 due to thats when the slope takes a huge dip in inertia.
sil_score = []
cluster_list = list(range(2, 10))
for n_clusters in cluster_list:
clusterer = KMeans(n_clusters=n_clusters)
preds = clusterer.fit_predict((subset_scaled_df))
# centers = clusterer.cluster_centers_
score = silhouette_score(subset_scaled_df, preds)
sil_score.append(score)
print("For n_clusters = {}, the silhouette score is {})".format(n_clusters, score))
plt.plot(cluster_list, sil_score)
plt.show()
/root/.local/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning warnings.warn( /root/.local/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning warnings.warn( /root/.local/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning warnings.warn(
For n_clusters = 2, the silhouette score is 0.43969639509980457) For n_clusters = 3, the silhouette score is 0.46179611378177166) For n_clusters = 4, the silhouette score is 0.45434371948348606) For n_clusters = 5, the silhouette score is 0.4485295447724099)
/root/.local/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning warnings.warn( /root/.local/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning warnings.warn(
For n_clusters = 6, the silhouette score is 0.39370682191208695) For n_clusters = 7, the silhouette score is 0.4268026342866841)
/root/.local/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning warnings.warn( /root/.local/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning warnings.warn(
For n_clusters = 8, the silhouette score is 0.39475222752400196) For n_clusters = 9, the silhouette score is 0.40556159035563283)
/root/.local/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning warnings.warn(
From Sihouette score, 4 seems to be the correct choice of clusters
# finding optimal number of clusters with silhouette coefficients
visualizer = SilhouetteVisualizer(KMeans(4, random_state=1))
visualizer.fit(subset_scaled_df)
visualizer.show()
/root/.local/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning warnings.warn(
<Axes: title={'center': 'Silhouette Plot of KMeans Clustering for 340 Samples in 4 Centers'}, xlabel='silhouette coefficient values', ylabel='cluster label'>
# finding optimal number of clusters with silhouette coefficients
visualizer = SilhouetteVisualizer(KMeans(4, random_state=1))
visualizer.fit(subset_scaled_df)
visualizer.show()
/root/.local/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning warnings.warn(
<Axes: title={'center': 'Silhouette Plot of KMeans Clustering for 340 Samples in 4 Centers'}, xlabel='silhouette coefficient values', ylabel='cluster label'>
#want 4 as our cluster
kmeans = KMeans(n_clusters=4, random_state=0)
kmeans.fit(subset_scaled_df)
/root/.local/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning warnings.warn(
KMeans(n_clusters=4, random_state=0)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
KMeans(n_clusters=4, random_state=0)
df['K_means_segments'] = kmeans.labels_
subset_scaled_df['K_means_segments'] = kmeans.labels_
# Select only numeric columns
numeric_df = df.select_dtypes(include=['number'])
# Group by clusters and calculate the mean for numeric columns only
cluster_profile = numeric_df.groupby(df['K_means_segments']).mean()
cluster_profile['count_in_each_segments'] = df.groupby('K_means_segments')['Security'].count().values
cluster_profile
| Current_Price | Price_Change | Volatility | ROE | Cash_Ratio | Net_Cash_Flow | Net_Income | Earnings_Per_Share | Estimated_Shares_Outstanding | P/E_Ratio | P/B_Ratio | K_means_segments | count_in_each_segments | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| K_means_segments | |||||||||||||
| 0 | 234.170932 | 13.400685 | 1.729989 | 25.600000 | 277.640000 | 1.554927e+09 | 1.572612e+09 | 6.045200 | 5.783163e+08 | 74.960824 | 14.402452 | 0.0 | 25 |
| 1 | 38.099260 | -15.370329 | 2.910500 | 107.074074 | 50.037037 | -1.594285e+08 | -3.887458e+09 | -9.473704 | 4.803986e+08 | 90.619220 | 1.342067 | 1.0 | 27 |
| 2 | 50.517273 | 5.747586 | 1.130399 | 31.090909 | 75.909091 | -1.072273e+09 | 1.483309e+10 | 4.154545 | 4.298827e+09 | 14.803577 | -4.552119 | 2.0 | 11 |
| 3 | 72.399112 | 5.066225 | 1.388319 | 34.620939 | 53.000000 | -1.404622e+07 | 1.482212e+09 | 3.621029 | 4.385338e+08 | 23.843656 | -3.358948 | 3.0 | 277 |
fig, axes = plt.subplots(3, 4, figsize=(20, 16))
fig.suptitle('Boxplot of numerical variables for each cluster', fontsize=20)
counter = 0
for ii in range(3):
for jj in range(4):
if counter <11:
sns.boxplot(ax=axes[ii, jj],y=subset_scaled_df[numeric_columns[counter]],x=subset_scaled_df['K_means_segments'])
counter = counter+1
fig.tight_layout(pad=2.0)
/root/.local/lib/python3.10/site-packages/seaborn/categorical.py:640: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas. positions = grouped.grouper.result_index.to_numpy(dtype=float) /root/.local/lib/python3.10/site-packages/seaborn/categorical.py:640: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas. positions = grouped.grouper.result_index.to_numpy(dtype=float) /root/.local/lib/python3.10/site-packages/seaborn/categorical.py:640: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas. positions = grouped.grouper.result_index.to_numpy(dtype=float) /root/.local/lib/python3.10/site-packages/seaborn/categorical.py:640: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas. positions = grouped.grouper.result_index.to_numpy(dtype=float) /root/.local/lib/python3.10/site-packages/seaborn/categorical.py:640: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas. positions = grouped.grouper.result_index.to_numpy(dtype=float) /root/.local/lib/python3.10/site-packages/seaborn/categorical.py:640: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas. positions = grouped.grouper.result_index.to_numpy(dtype=float) /root/.local/lib/python3.10/site-packages/seaborn/categorical.py:640: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas. positions = grouped.grouper.result_index.to_numpy(dtype=float) /root/.local/lib/python3.10/site-packages/seaborn/categorical.py:640: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas. positions = grouped.grouper.result_index.to_numpy(dtype=float) /root/.local/lib/python3.10/site-packages/seaborn/categorical.py:640: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas. positions = grouped.grouper.result_index.to_numpy(dtype=float) /root/.local/lib/python3.10/site-packages/seaborn/categorical.py:640: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas. positions = grouped.grouper.result_index.to_numpy(dtype=float) /root/.local/lib/python3.10/site-packages/seaborn/categorical.py:640: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas. positions = grouped.grouper.result_index.to_numpy(dtype=float)
Insights
- Cluster 0
- has about 276 different securities
-Cluster 1
- has about 26
- has the lowest current price
- has the most negative price volitility
- has the highest price volitilty
- highest ROE
- lowest cash ratio
- CLuster 2
- has about 25
- has the highest P/E and P/B ratio
- has the highest current price
- has the most positive price volitily
- lowest ROE
- highest cash ratio
- Cluster 3
- has about 13
- has the lowest P/E and P/B ratio
- has the lowest volitality
Clusters 0 and 1 are the safe clusters, cluster 0 has less number o securities. The other clusters have more risk, though 2 has a high cash ratio
# Comparing cluster vs. GICS_Sector
pd.crosstab(df.GICS_Sector, df.K_means_segments).style.highlight_max(color = 'lightblue', axis = 0)
| K_means_segments | 0 | 1 | 2 | 3 |
|---|---|---|---|---|
| GICS_Sector | ||||
| Consumer Discretionary | 6 | 0 | 1 | 33 |
| Consumer Staples | 1 | 0 | 1 | 17 |
| Energy | 1 | 22 | 1 | 6 |
| Financials | 1 | 0 | 3 | 45 |
| Health Care | 9 | 0 | 2 | 29 |
| Industrials | 0 | 1 | 0 | 52 |
| Information Technology | 5 | 3 | 1 | 24 |
| Materials | 0 | 1 | 0 | 19 |
| Real Estate | 1 | 0 | 0 | 26 |
| Telecommunications Services | 1 | 0 | 2 | 2 |
| Utilities | 0 | 0 | 0 | 24 |
In cluster 0 most of the securities are the industrials then financials in cluster 1 most of them are enegry followed by information tech in cluster 2 most of them are health care folowed by consumer discretionary and information technology In cluster 3 fincnails at three are the majority of securities
Hierarchical Clustering¶
#distance metrics
distance_metrics = ["euclidean", "chebyshev", "mahalanobis", "cityblock"]
#linkage methods
linkage_methods = ["single", "complete", "average", "weighted"]
high_cophenet_corr = 0
high_dm_lm = [0, 0]
for dm in distance_metrics:
for lm in linkage_methods:
Z = linkage(subset_scaled_df, metric=dm, method=lm)
c, coph_dists = cophenet(Z, pdist(subset_scaled_df))
print(
"Cophenetic correlation for {} distance and {} linkage is {}".format(
dm.capitalize(), lm, c
)
)
if high_cophenet_corr < c:
high_cophenet_corr = c
high_dm_lm[0] = dm
high_dm_lm[1] = lm
Cophenetic correlation for Euclidean distance and single linkage is 0.9304469769832865 Cophenetic correlation for Euclidean distance and complete linkage is 0.8559480642212798 Cophenetic correlation for Euclidean distance and average linkage is 0.946403836884538 Cophenetic correlation for Euclidean distance and weighted linkage is 0.7508819056084053 Cophenetic correlation for Chebyshev distance and single linkage is 0.9161627445317929 Cophenetic correlation for Chebyshev distance and complete linkage is 0.822502094153258 Cophenetic correlation for Chebyshev distance and average linkage is 0.9379218754329659 Cophenetic correlation for Chebyshev distance and weighted linkage is 0.9153206618543516 Cophenetic correlation for Mahalanobis distance and single linkage is 0.9348505176633238 Cophenetic correlation for Mahalanobis distance and complete linkage is 0.6881861661402056 Cophenetic correlation for Mahalanobis distance and average linkage is 0.9360657692078034 Cophenetic correlation for Mahalanobis distance and weighted linkage is 0.8810701545336993 Cophenetic correlation for Cityblock distance and single linkage is 0.938373245895409 Cophenetic correlation for Cityblock distance and complete linkage is 0.8124007660644492 Cophenetic correlation for Cityblock distance and average linkage is 0.9168123859372297 Cophenetic correlation for Cityblock distance and weighted linkage is 0.866729262879581
# distance metric and linkage method with the highest cophenetic correlation
print(
"Highest cophenetic correlation is {}, which can be gathered from the sum of {} distance and {} linkage".format(
high_cophenet_corr, high_dm_lm[0].capitalize(), high_dm_lm[1]
)
)
Highest cophenetic correlation is 0.946403836884538, which can be gathered from the sum of Euclidean distance and average linkage
# linkage methods
linkage_methods = ["single", "complete", "average", "centroid", "ward", "weighted"]
high_cophenet_corr = 0
high_dm_lm = [0, 0]
for lm in linkage_methods:
Z = linkage(subset_scaled_df, metric="euclidean", method=lm)
c, coph_dists = cophenet(Z, pdist(subset_scaled_df))
print("Cophenetic correlation for {} linkage is {}".format(lm, c))
if high_cophenet_corr < c:
high_cophenet_corr = c
high_dm_lm[0] = "euclidean"
high_dm_lm[1] = lm
Cophenetic correlation for single linkage is 0.9304469769832865 Cophenetic correlation for complete linkage is 0.8559480642212798 Cophenetic correlation for average linkage is 0.946403836884538 Cophenetic correlation for centroid linkage is 0.9494262703881242 Cophenetic correlation for ward linkage is 0.7436374975239648 Cophenetic correlation for weighted linkage is 0.7508819056084053
# combination of distance metric and linkage method with the highest cophenetic correlation
print(
"Highest cophenetic correlation is {}, which which can be gathered from the sum of {} linkage".format(
high_cophenet_corr, high_dm_lm[1]
)
)
Highest cophenetic correlation is 0.9494262703881242, which which can be gathered from the sum of centroid linkage
#linkage methods
linkage_methods = ["single", "complete", "average", "centroid", "ward", "weighted"]
# saved results of cophenetic correlation calculation
compare_cols = ["Linkage", "Cophenetic Coefficient"]
# subplot image
fig, axs = plt.subplots(len(linkage_methods), 1, figsize=(15, 30))
# For each linkage method, plot the dendrogram, calculate the cophenetic correlation
for i, method in enumerate(linkage_methods):
Z = linkage(subset_scaled_df, metric="euclidean", method=method)
dendrogram(Z, ax=axs[i])
axs[i].set_title(f"Dendrogram ({method.capitalize()} Linkage)")
coph_corr, coph_dist = cophenet(Z, pdist(subset_scaled_df))
axs[i].annotate(
f"Cophenetic\nCorrelation\n{coph_corr:0.2f}",
(0.80, 0.80),
xycoords="axes fraction",
)
- Cophenetic correlation is the highest with Centroid Linkage, Average Linkage & Complete and Weighted Linkage
- Using Centroid Linkaid the number 6 is the appriate number of clusters from the dendrogram
from sklearn.cluster import AgglomerativeClustering
# Corrected model with metric
HCmodel = AgglomerativeClustering(n_clusters=6, metric="euclidean", linkage="average")
HCmodel.fit(subset_scaled_df)
AgglomerativeClustering(linkage='average', metric='euclidean', n_clusters=6)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
AgglomerativeClustering(linkage='average', metric='euclidean', n_clusters=6)
subset_scaled_df["HC_Clusters"] = HCmodel.labels_
df["HC_Clusters"] = HCmodel.labels_
# Select only numeric columns
numeric_df = df.select_dtypes(include=['number'])
# Group by the clustering column and calculate the mean for numeric columns only
cluster_profile = numeric_df.groupby(df["HC_Clusters"]).mean()
cluster_profile["count_in_each_segments"] = (
df.groupby("HC_Clusters")["Security"].count().values
)
cluster_profile
| Current_Price | Price_Change | Volatility | ROE | Cash_Ratio | Net_Cash_Flow | Net_Income | Earnings_Per_Share | Estimated_Shares_Outstanding | P/E_Ratio | P/B_Ratio | K_means_segments | HC_Clusters | count_in_each_segments | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| HC_Clusters | ||||||||||||||
| 0 | 77.287589 | 4.099730 | 1.518066 | 35.336336 | 66.900901 | -3.319732e+07 | 1.538075e+09 | 2.88527 | 5.605050e+08 | 32.441706 | -2.174921 | 2.630631 | 0.0 | 333 |
| 1 | 25.640000 | 11.237908 | 1.322355 | 12.500000 | 130.500000 | 1.675550e+10 | 1.365400e+10 | 3.29500 | 2.791829e+09 | 13.649696 | 1.508484 | 1.000000 | 1.0 | 2 |
| 2 | 24.485001 | -13.351992 | 3.482611 | 802.000000 | 51.000000 | -1.292500e+09 | -1.910650e+10 | -41.81500 | 5.195740e+08 | 60.748608 | 1.565141 | 1.000000 | 2.0 | 2 |
| 3 | 104.660004 | 16.224320 | 1.320606 | 8.000000 | 958.000000 | 5.920000e+08 | 3.669000e+09 | 1.31000 | 2.800763e+09 | 79.893133 | 5.884467 | 0.000000 | 3.0 | 1 |
| 4 | 1274.949951 | 3.190527 | 1.268340 | 29.000000 | 184.000000 | -1.671386e+09 | 2.551360e+09 | 50.09000 | 5.093552e+07 | 25.453183 | -1.052429 | 0.000000 | 4.0 | 1 |
| 5 | 276.570007 | 6.189286 | 1.116976 | 30.000000 | 25.000000 | 9.088500e+07 | 5.965410e+08 | 8.91000 | 6.695185e+07 | 31.040405 | 129.064585 | 0.000000 | 5.0 | 1 |
- There are five clusters, with less than three securities. This is not a good set due to the lack of variability. Should look for a different linkage, if I use ward the number of four clusters seems to be the appriate number.
from sklearn.cluster import AgglomerativeClustering
# Corrected model with metric
HCmodel = AgglomerativeClustering(n_clusters=4, metric="euclidean", linkage="ward")
HCmodel.fit(subset_scaled_df)
AgglomerativeClustering(metric='euclidean', n_clusters=4)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
AgglomerativeClustering(metric='euclidean', n_clusters=4)
AgglomerativeClustering(n_clusters=4)
AgglomerativeClustering(n_clusters=4)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
AgglomerativeClustering(n_clusters=4)
subset_scaled_df["HC_Clusters"] = HCmodel.labels_
df["HC_Clusters"] = HCmodel.labels_
# Select only numeric columns
numeric_df = df.select_dtypes(include=['number'])
cluster_profile = numeric_df.groupby(df["HC_Clusters"]).mean()
cluster_profile["count_in_each_segments"] = (
df.groupby("HC_Clusters")["Security"].count().values
)
cluster_profile
| Current_Price | Price_Change | Volatility | ROE | Cash_Ratio | Net_Cash_Flow | Net_Income | Earnings_Per_Share | Estimated_Shares_Outstanding | P/E_Ratio | P/B_Ratio | K_means_segments | HC_Clusters | count_in_each_segments | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| HC_Clusters | ||||||||||||||
| 0 | 46.558126 | -11.798670 | 2.617878 | 178.750000 | 50.250000 | 4.349716e+07 | -3.197472e+09 | -7.785312 | 4.732895e+08 | 72.496532 | -0.780467 | 1.281250 | 0.0 | 32 |
| 1 | 71.846974 | 4.953643 | 1.392784 | 25.117216 | 53.831502 | 1.197788e+06 | 1.557674e+09 | 3.691044 | 4.439183e+08 | 23.583804 | -3.087957 | 2.996337 | 1.0 | 273 |
| 2 | 229.579357 | 14.049986 | 1.735216 | 25.423077 | 268.423077 | 1.712688e+09 | 1.981882e+09 | 5.946923 | 7.219242e+08 | 84.216911 | 13.114240 | 0.115385 | 2.0 | 26 |
| 3 | 46.672222 | 5.166566 | 1.079367 | 25.000000 | 58.333333 | -3.040667e+09 | 1.484844e+10 | 3.435556 | 4.564960e+09 | 15.596051 | -6.354193 | 2.000000 | 3.0 | 9 |
#The names of the securities within each cluster
for cl in df["HC_Clusters"].unique():
print(
"The",
df[df["HC_Clusters"] == cl]["Security"].nunique(),
"Securities in cluster",
cl,
"are:",
)
print(df[df["HC_Clusters"] == cl]["Security"].unique())
print("-" * 100, "\n")
The 273 Securities in cluster 1 are: ['American Airlines Group' 'AbbVie' 'Abbott Laboratories' 'Adobe Systems Inc' 'Archer-Daniels-Midland Co' 'Ameren Corp' 'American Electric Power' 'AFLAC Inc' 'American International Group, Inc.' 'Apartment Investment & Mgmt' 'Assurant Inc' 'Arthur J. Gallagher & Co.' 'Akamai Technologies Inc' 'Albemarle Corp' 'Alaska Air Group Inc' 'Allstate Corp' 'Applied Materials Inc' 'AMETEK Inc' 'Affiliated Managers Group Inc' 'Ameriprise Financial' 'American Tower Corp A' 'AutoNation Inc' 'Anthem Inc.' 'Aon plc' 'Amphenol Corp' 'Arconic Inc' 'Activision Blizzard' 'AvalonBay Communities, Inc.' 'Broadcom' 'American Water Works Company Inc' 'American Express Co' 'Boeing Company' 'Baxter International Inc.' 'BB&T Corporation' 'Bard (C.R.) Inc.' 'The Bank of New York Mellon Corp.' 'Ball Corp' 'Bristol-Myers Squibb' 'Boston Scientific' 'BorgWarner' 'Boston Properties' 'Caterpillar Inc.' 'Chubb Limited' 'CBRE Group' 'Crown Castle International Corp.' 'Carnival Corp.' 'CF Industries Holdings Inc' 'Citizens Financial Group' 'Church & Dwight' 'C. H. Robinson Worldwide' 'CIGNA Corp.' 'Cincinnati Financial' 'Comerica Inc.' 'CME Group Inc.' 'Cummins Inc.' 'CMS Energy' 'Centene Corporation' 'CenterPoint Energy' 'Capital One Financial' 'The Cooper Companies' 'CSX Corp.' 'CenturyLink Inc' 'Cognizant Technology Solutions' 'Citrix Systems' 'CVS Health' 'Chevron Corp.' 'Dominion Resources' 'Delta Air Lines' 'Du Pont (E.I.)' 'Deere & Co.' 'Discover Financial Services' 'Quest Diagnostics' 'Danaher Corp.' 'The Walt Disney Company' 'Discovery Communications-A' 'Discovery Communications-C' 'Delphi Automotive' 'Digital Realty Trust' 'Dun & Bradstreet' 'Dover Corp.' 'Dr Pepper Snapple Group' 'Duke Energy' 'DaVita Inc.' 'eBay Inc.' 'Ecolab Inc.' 'Consolidated Edison' 'Equifax Inc.' "Edison Int'l" 'Eastman Chemical' 'Equity Residential' 'Eversource Energy' 'Essex Property Trust, Inc.' 'E*Trade' 'Eaton Corporation' 'Entergy Corp.' 'Exelon Corp.' "Expeditors Int'l" 'Expedia Inc.' 'Extra Space Storage' 'Fastenal Co' 'Fortune Brands Home & Security' 'FirstEnergy Corp' 'Fidelity National Information Services' 'Fiserv Inc' 'FLIR Systems' 'Fluor Corp.' 'Flowserve Corporation' 'FMC Corporation' 'Federal Realty Investment Trust' 'General Dynamics' 'General Growth Properties Inc.' 'Gilead Sciences' 'Corning Inc.' 'General Motors' 'Genuine Parts' 'Garmin Ltd.' 'Goodyear Tire & Rubber' 'Grainger (W.W.) Inc.' 'Hasbro Inc.' 'Huntington Bancshares' 'HCA Holdings' 'Welltower Inc.' 'HCP Inc.' 'Hartford Financial Svc.Gp.' 'Harley-Davidson' "Honeywell Int'l Inc." 'HP Inc.' 'Hormel Foods Corp.' 'Henry Schein' 'Host Hotels & Resorts' 'The Hershey Company' 'Humana Inc.' 'International Business Machines' 'IDEXX Laboratories' 'Intl Flavors & Fragrances' 'International Paper' 'Interpublic Group' 'Iron Mountain Incorporated' 'Illinois Tool Works' 'Invesco Ltd.' 'J. B. Hunt Transport Services' 'Jacobs Engineering Group' 'Juniper Networks' 'Kimco Realty' 'Kansas City Southern' 'Leggett & Platt' 'Lennar Corp.' 'Laboratory Corp. of America Holding' 'LKQ Corporation' 'L-3 Communications Holdings' 'Lilly (Eli) & Co.' 'Lockheed Martin Corp.' 'Alliant Energy Corp' 'Leucadia National Corp.' 'Southwest Airlines' 'Level 3 Communications' 'LyondellBasell' 'Mastercard Inc.' 'Mid-America Apartments' 'Macerich' "Marriott Int'l." 'Masco Corp.' 'Mattel Inc.' "Moody's Corp" 'Mondelez International' 'MetLife Inc.' 'Mohawk Industries' 'Mead Johnson' 'McCormick & Co.' 'Martin Marietta Materials' 'Marsh & McLennan' '3M Company' 'Altria Group Inc' 'The Mosaic Company' 'Marathon Petroleum' 'Merck & Co.' 'M&T Bank Corp.' 'Mettler Toledo' 'Mylan N.V.' 'Navient' 'NASDAQ OMX Group' 'NextEra Energy' 'Newmont Mining Corp. (Hldg. Co.)' 'Nielsen Holdings' 'Norfolk Southern Corp.' 'Northern Trust Corp.' 'Nucor Corp.' 'Newell Brands' 'Realty Income Corporation' 'Omnicom Group' "O'Reilly Automotive" "People's United Financial" 'Pitney-Bowes' 'PACCAR Inc.' 'PG&E Corp.' 'Public Serv. Enterprise Inc.' 'PepsiCo Inc.' 'Principal Financial Group' 'Procter & Gamble' 'Progressive Corp.' 'Pulte Homes Inc.' 'Philip Morris International' 'PNC Financial Services' 'Pentair Ltd.' 'Pinnacle West Capital' 'PPG Industries' 'PPL Corp.' 'Prudential Financial' 'Phillips 66' 'Praxair Inc.' 'PayPal' 'Ryder System' 'Royal Caribbean Cruises Ltd' 'Robert Half International' 'Roper Industries' 'Republic Services Inc' 'SCANA Corp' 'Charles Schwab Corporation' 'Spectra Energy Corp.' 'Sealed Air' 'Sherwin-Williams' 'SL Green Realty' 'Scripps Networks Interactive Inc.' 'Southern Co.' 'Simon Property Group Inc' 'Stericycle Inc' 'Sempra Energy' 'SunTrust Banks' 'State Street Corp.' 'Skyworks Solutions' 'Synchrony Financial' 'Stryker Corp.' 'Molson Coors Brewing Company' 'Tegna, Inc.' 'Torchmark Corp.' 'Thermo Fisher Scientific' 'The Travelers Companies Inc.' 'Tractor Supply Company' 'Tyson Foods' 'Tesoro Petroleum Co.' 'Total System Services' 'Texas Instruments' 'Under Armour' 'United Continental Holdings' 'UDR Inc' 'Universal Health Services, Inc.' 'United Health Group Inc.' 'Unum Group' 'Union Pacific' 'United Parcel Service' 'United Technologies' 'Varian Medical Systems' 'Valero Energy' 'Vulcan Materials' 'Vornado Realty Trust' 'Verisk Analytics' 'Verisign Inc.' 'Ventas Inc' 'Wec Energy Group Inc' 'Whirlpool Corp.' 'Waste Management Inc.' 'Western Union Co' 'Weyerhaeuser Corp.' 'Wyndham Worldwide' 'Xcel Energy Inc' 'XL Capital' 'Dentsply Sirona' 'Xerox Corp.' 'Xylem Inc.' 'Yum! Brands Inc' 'Zimmer Biomet Holdings' 'Zions Bancorp' 'Zoetis'] ---------------------------------------------------------------------------------------------------- The 26 Securities in cluster 2 are: ['Analog Devices, Inc.' 'Alliance Data Systems' 'Alexion Pharmaceuticals' 'Amgen Inc' 'Amazon.com Inc' 'Bank of America Corp' 'BIOGEN IDEC Inc.' 'Celgene Corp.' 'Chipotle Mexican Grill' 'Equinix' 'Edwards Lifesciences' 'Facebook' 'First Solar Inc' 'Frontier Communications' 'Intel Corp.' 'Intuitive Surgical Inc.' "McDonald's Corp." 'Monster Beverage' 'Netflix Inc.' 'Priceline.com Inc' 'Regeneron' 'TripAdvisor' 'Vertex Pharmaceuticals Inc' 'Waters Corporation' 'Wynn Resorts Ltd' 'Yahoo Inc.'] ---------------------------------------------------------------------------------------------------- The 32 Securities in cluster 0 are: ['Allegion' 'Apache Corporation' 'Anadarko Petroleum Corp' 'Baker Hughes Inc' 'Chesapeake Energy' 'Charter Communications' 'Colgate-Palmolive' 'Cabot Oil & Gas' 'Concho Resources' 'Devon Energy Corp.' 'EOG Resources' 'EQT Corporation' 'Freeport-McMoran Cp & Gld' 'Halliburton Co.' 'Hess Corporation' 'Hewlett Packard Enterprise' 'Kimberly-Clark' 'Kinder Morgan' 'Marathon Oil Corp.' 'Murphy Oil' 'Noble Energy Inc' 'Newfield Exploration Co' 'National Oilwell Varco Inc.' 'ONEOK' 'Occidental Petroleum' 'Quanta Services Inc.' 'Range Resources Corp.' 'S&P Global, Inc.' 'Southwestern Energy' 'Teradata Corp.' 'Williams Cos.' 'Cimarex Energy'] ---------------------------------------------------------------------------------------------------- The 9 Securities in cluster 3 are: ['Citigroup Inc.' 'Ford Motor' 'JPMorgan Chase & Co.' 'Coca Cola Company' 'Pfizer Inc.' 'AT&T Inc' 'Verizon Communications' 'Wells Fargo' 'Exxon Mobil Corp.'] ----------------------------------------------------------------------------------------------------
fig, axes = plt.subplots(3, 4, figsize=(20, 16))
fig.suptitle('Boxplot of numerical variables for each cluster', fontsize=20)
counter = 0
for ii in range(3):
for jj in range(4):
if counter <11:
sns.boxplot(ax=axes[ii, jj],y=subset_scaled_df[numeric_columns[counter]],x=subset_scaled_df['HC_Clusters'])
counter = counter+1
fig.tight_layout(pad=2.0)
/root/.local/lib/python3.10/site-packages/seaborn/categorical.py:640: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas. positions = grouped.grouper.result_index.to_numpy(dtype=float) /root/.local/lib/python3.10/site-packages/seaborn/categorical.py:640: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas. positions = grouped.grouper.result_index.to_numpy(dtype=float) /root/.local/lib/python3.10/site-packages/seaborn/categorical.py:640: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas. positions = grouped.grouper.result_index.to_numpy(dtype=float) /root/.local/lib/python3.10/site-packages/seaborn/categorical.py:640: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas. positions = grouped.grouper.result_index.to_numpy(dtype=float) /root/.local/lib/python3.10/site-packages/seaborn/categorical.py:640: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas. positions = grouped.grouper.result_index.to_numpy(dtype=float) /root/.local/lib/python3.10/site-packages/seaborn/categorical.py:640: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas. positions = grouped.grouper.result_index.to_numpy(dtype=float) /root/.local/lib/python3.10/site-packages/seaborn/categorical.py:640: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas. positions = grouped.grouper.result_index.to_numpy(dtype=float) /root/.local/lib/python3.10/site-packages/seaborn/categorical.py:640: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas. positions = grouped.grouper.result_index.to_numpy(dtype=float) /root/.local/lib/python3.10/site-packages/seaborn/categorical.py:640: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas. positions = grouped.grouper.result_index.to_numpy(dtype=float) /root/.local/lib/python3.10/site-packages/seaborn/categorical.py:640: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas. positions = grouped.grouper.result_index.to_numpy(dtype=float) /root/.local/lib/python3.10/site-packages/seaborn/categorical.py:640: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas. positions = grouped.grouper.result_index.to_numpy(dtype=float)
#Cluster vs. GICS_Sector
pd.crosstab(df.GICS_Sector, df.HC_Clusters).style.highlight_max(color = 'lightblue', axis = 0)
| HC_Clusters | 0 | 1 | 2 | 3 |
|---|---|---|---|---|
| GICS_Sector | ||||
| Consumer Discretionary | 1 | 32 | 6 | 1 |
| Consumer Staples | 2 | 15 | 1 | 1 |
| Energy | 23 | 6 | 0 | 1 |
| Financials | 1 | 44 | 1 | 3 |
| Health Care | 0 | 30 | 9 | 1 |
| Industrials | 2 | 51 | 0 | 0 |
| Information Technology | 2 | 24 | 7 | 0 |
| Materials | 1 | 19 | 0 | 0 |
| Real Estate | 0 | 26 | 1 | 0 |
| Telecommunications Services | 0 | 2 | 1 | 2 |
| Utilities | 0 | 24 | 0 | 0 |
- Insights
- the grouping obtained with Hierachical clustering with Eucledian Distance and Ward Linkage is similar to K means clustering
-CLustr 0 of HC is similar to CLuster 1 of KMeans, Cluster 1 of HC is similar to CLuster 3 of KMeans, CLuster 2 HC is similar to 0 of Kmeans and Cluster 3 is similar to Cluster 2 of KMeans
K-means vs Hierarchical Clustering¶
You compare several things, like:
- Which clustering technique took less time for execution?
- Which clustering technique gave you more distinct clusters, or are they the same?
- How many observations are there in the similar clusters of both algorithms?
- How many clusters are obtained as the appropriate number of clusters from both algorithms?
You can also mention any differences or similarities you obtained in the cluster profiles from both the clustering techniques.
#Hierarchical cluster vs. K-means cluster
pd.crosstab(df.K_means_segments, df.HC_Clusters).style.highlight_max(color = 'lightblue', axis = 0)
| HC_Clusters | 0 | 1 | 2 | 3 |
|---|---|---|---|---|
| K_means_segments | ||||
| 0 | 1 | 0 | 24 | 0 |
| 1 | 26 | 0 | 1 | 0 |
| 2 | 0 | 1 | 1 | 9 |
| 3 | 5 | 272 | 0 | 0 |
The four clusters are able to group 327 of the securities
# PCA to reduce the data to two dimensions and visualize it
# set number of components to 2
pca = PCA(n_components=2)
# transform data and store results in a dataframe
X_reduced_pca = pca.fit_transform(subset_scaled_df)
reduced_df_pca = pd.DataFrame(
data=X_reduced_pca, columns=["Component 1", "Component 2"]
)
#variance explained
pca.explained_variance_ratio_.sum()
0.38933325456234846
##KMeans Clustering
sns.scatterplot(
data=reduced_df_pca,
x="Component 1",
y="Component 2",
hue=df["K_means_segments"],
palette="rainbow",
)
plt.legend(bbox_to_anchor=(1, 1))
<matplotlib.legend.Legend at 0x7af38e03cfa0>
##Hierachical CLustering Scatterplot
sns.scatterplot(
data=reduced_df_pca,
x="Component 1",
y="Component 2",
hue=df["HC_Clusters"],
palette="rainbow",
)
plt.legend(bbox_to_anchor=(1, 1))
<matplotlib.legend.Legend at 0x7af391ba15a0>
Generally the Hierachical Clustering and K Means clustering are similar, and in each one cluster seems to be near the other three clusters data points
Actionable Insights and Recommendations¶
-Energy has the highest variance, though the secutiry has high P/E ations. THis indicates that even though they are more volitile, investors invest in the shares more than others.
-Information Tech and Finance have the highest cash rations meaning theyr e more prefered compared to others.
Healthcare an Finace have seen the highest price change meaning more people have been buying into these sectors
- The cluster with a majority of the securities are safe investments given the research. THis cluster has Financials, Real Estate & Informational Technology sectors.
Recommendations
- The stock market is always volitile, though given the way that the clusters have been identified were between volitile high performing and low perfoming, aggresive and high performing and relatively mild options.